In [944]:
import os
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import numpy as np
 
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from time import time
from sklearn.compose import ColumnTransformer
from scipy.stats import norm
import scipy.stats as stats
import numpy as np

Inimse hingamise sageduse ja teiste füüsiliste parameetrite regressioonanalüüs¶

ITB8814 Andmekaevandamine, Projekt¶

Autor: Juri Lunin¶

Kuupäev 17.05.2024

Sissejuhatus¶

Käesolevas projektitöö on valitud uurimiseks andmestik: "Energy Expenditure of Human Physical Activity".

Andmestiku kirjeldus¶

Andmestikus on inimeste füüsilised omadused ja info nende seisundi kohta, mis avaldub füüsilise aktiivsuse ajal. Energy Expenditure of Human Physical Activity Faili formaat: csv*.

Andmestikuga koos on postitatud kaks teadusliku artikli:

  • Activity recognition using wearable sensors for tracking the elderly
  • A recurrent neural network architecture to model physical activity energy expenditure in older people

Andmestiku tunnused:

  • ID - participant's ID
  • trial_date - date and time when data collection started at ID level
  • gender - sex = male or female
  • age - in years
  • weight - in kg
  • height - in cm
  • bmi - Body mass index in kg/m
  • gaAnkle - TRUE if data from GENEActiv on the ankle exist, FALSE otherwise
  • gaChest - TRUE if data from GENEActiv on the chest exist, FALSE otherwise
  • gaWrist - TRUE if data from GENEActiv on the wrist exist, FALSE otherwise
  • equivital - TRUE if data from Equivital exist, FALSE otherwise
  • cosmed - TRUE if data from COSMED exist, FALSE otherwise
  • EEm - Energy Expenditure per minute, in Kcal
  • COSMEDset_row - the original indexes of COSMED data (used for merging)
  • EEh - Energy Expenditure per hour, in Kcal
  • EEtot - Total Kcal spent (it is reseted between indoor and outdoor measurements)
  • METS - Metabolic Equivalent per minute
  • Rf - Respiratory Frequency (litre/min)
  • BR - Breath Rate
  • VT - Tidal Volume in litre
  • VE - Expiratory Minute Ventilation (litre/min)
  • VO2 - Oxygen Uptake (ml/min)
  • VCO2 - Carbon Dioxide production (ml/min)
  • O2exp - Volume of O2 expired (ml/min)
  • CO2exp - Volume of CO2 expired (ml/min)
  • FeO2 - Averaged expiratory concentration of O2 (%)
  • FeCO2 - Averaged expiratory concentration of CO2 (%)
  • FiO2 - Fraction of inspired O2 (%)
  • FiCO2 - Fraction of inspired CO2 (%)
  • VE.VO2 - Ventilatory equivalent for O2
  • VE.VCO2 - Ventilatory equivalent for CO2
  • R - Respiratory Quotient
  • Ti - Duration of Inspiration (seconds)
  • Te - Duration of Expiration (seconds)
  • Ttot - Duration of Total breathing cycle (seconds)
  • VO2.HR - Oxygen pulse (ml/beat)
  • HR - Heart Rate
  • Qt - Cardiac output (litre)
  • SV - Stroke volume (litre/min)
  • original_activity_labels - True activity label as noted from study protocol, NA if is unknown
  • predicted_activity_label - Predicted activity label by model from [1], NA if is unknown

Uurimiseesmärk¶

Prognoositav tunnus ehk sihttunnus on Y=“BR” breath rate.

Eesmärk: parima mitme argumendiga regressiooni mudeli leidmine sihttunnuse BR prognoosimiseks.

Tööülesanded (tööhüpoteesid)¶

  1. Madalama hingamise sagedusega inimestel on tugevam tervis.

Metoodika ja uuringu käik¶

Käesoleva projektis kasutasime järgnevaid analüüsi meetodeid:

  • Lineaarregressioon
  • Polünomiaalregressioon
  • Otsustuspuu regressiooni mudel
  • Random forest regressiooni mudel

Sisukord¶

  • Andmeteanalüüs
  • Andmete lugemune
  • Ülevaade andmestiku struktuurist
  • Andmestiku puhastamine
  • Tunnuste teisendamine
  • Tunnuste kirjeldus
  • Seoste analüüs
  • Lineaarregressioon
  • Polünomiaalregressioon
  • Otsustuspuu regressioon
  • Random Forest regressioon
  • Kokkuvõtte
In [945]:
_DATA_PATH = 'data/EEHPA.csv'
_SIHTTUNNUS_ = 'BR'
_GOAL_ = 'Leida parima mitme argumendiga regressiooni mudeli sihttunnuse BR ennustamise jaoks.'
_DROP_= ['age', 'weight', 'height', 'bmi', 'ID',
         'gaAnkle', 'gaChest', 'gaWrist', 'equivital', 'cosmed', 'COSMEDset_row',
         'trial_date',
         'VE.VO2', 'VE.VCO2', 'R', 'FiO2', 'FiCO2',
         'VO2', 'VCO2', 'EEm', 'EEh',
         'Ti', 'Te', 'Ttot',
         'Qt', 'SV',
         'predicted_activity_label']
_OBJ_CAST_= []
_DROP_UNNAMED_ = True

Andmestikus on liiga palju attribuute, mis ei mõjuta oluliselt uuringu tulemusi. Kustutame selliseid andmestikust kohe. Mõned tunnused on tuvastatud mitteolulisteks seoste analüüsi käigus.

Kustutame 'ID'. See ei mängi rolli uuringus.

Kustutame boolean tunnused seadmetest andmete kohta: 'gaAnkle', 'gaChest', 'gaWrist', 'equivital', 'cosmed' Need ei ole nii tähtsad ja nende väärtused on enamasti 'True'.

Kustutame ka 'COSMEDset_row', 'trial_date', Need ei mängi rolli praeguses uuringus.

Kustutame 'FiO2', 'FiCO2' ning 'Qt', 'SV'. Nendende vahel on ebamäärane seos, mis selgus seoste analüüsimisel.

Seoste analüüsi käigus on selgunud järgmised nõrgad ja loomulikud seosed, millised ei näita tugevat korrelatsiooni ega ole niivõrd huvitavad uuringus. Samuti need võivad segada, mis selgus peale Clustermap analüüsi:

Kustutame 'age', 'weight', 'height', 'bmi'. Kustutame 'VE.VO2', 'VE.VCO2', 'R'. Nende kohta võiks eraldi uuringu teha, kuid praegu nad on liiga nõrga korrelatsiooniga.

Lineaarregressiooni mudeli tehes on avastatud, et attribuudid 'Ti', 'Te', 'Ttot' on liiga kõrge mõjuga, mis on loomulik ja need ei ole lineaarses sõltuvuses teiste tunnustega. Kustutame 'Ti', 'Te', 'Ttot'.

Kustutame ka 'VO2', 'VCO2', 'EEm', 'EEh'. Liiga suur kordajate vahe.

Kustutame attribuudi 'predicted_activity_label', kuna see pärineb teise mudeli konstrueerimisest ning võib tekitada müra.

Andmeanalüüs¶

Andmete lugemine¶

In [946]:
df = pd.read_csv(_DATA_PATH)
df
Out[946]:
ID trial_date gender age weight height bmi gaAnkle gaChest gaWrist ... R Ti Te Ttot VO2.HR HR Qt SV original_activity_labels predicted_activity_label
0 GOTOV05 08/02/2016 13:42 female 61.6000 68.6000 162 26.1000 True True True ... 0.9846 0.9300 1.8600 2.7900 2.1250 102 0 0 NaN sitting
1 GOTOV05 08/02/2016 13:42 female 61.6000 68.6000 162 26.1000 True True True ... 1.0035 1.2600 1.1800 2.4400 2.2403 103 0 0 NaN NaN
2 GOTOV05 08/02/2016 13:42 female 61.6000 68.6000 162 26.1000 True True True ... 1.0399 0.9700 1.6900 2.6600 4.2051 104 0 0 NaN NaN
3 GOTOV05 08/02/2016 13:42 female 61.6000 68.6000 162 26.1000 True True True ... 1.0635 0.9600 2.0400 3.0000 4.3329 106 0 0 lyingDownRight standing
4 GOTOV05 08/02/2016 13:42 female 61.6000 68.6000 162 26.1000 True True True ... 1.0307 1.1500 1.5700 2.7200 2.6086 106 0 0 lyingDownRight NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38718 GOTOV36 30/05/2016 10:41 female 81.3000 72.0000 167 25.8000 True True True ... 0.8145 1.0100 1.7400 2.7500 3.9375 77 0 0 NaN standing
38719 GOTOV36 30/05/2016 10:41 female 81.3000 72.0000 167 25.8000 True True True ... 0.7833 1.1100 1.9100 3.0200 3.1757 77 0 0 NaN NaN
38720 GOTOV36 30/05/2016 10:41 female 81.3000 72.0000 167 25.8000 True True True ... 0.7643 0.8300 1.4100 2.2400 7.3279 77 0 0 NaN NaN
38721 GOTOV36 30/05/2016 10:41 female 81.3000 72.0000 167 25.8000 True True True ... 0.7948 1.2500 2.7700 4.0200 4.4365 77 0 0 NaN NaN
38722 GOTOV36 30/05/2016 10:41 female 81.3000 72.0000 167 25.8000 True True True ... 0.7068 1.1300 1.2900 2.4200 4.7681 77 0 0 NaN NaN

38723 rows × 41 columns

Ülevaade andmestiku struktuurist¶

Andmestiku suurus:

In [947]:
print(f"Andmestikus on \033[1m{df.shape[0]}\033[0m ridu, neid iseloomustab \033[1m{df.shape[1]}\033[0m tunnust.")
Andmestikus on 38723 ridu, neid iseloomustab 41 tunnust.

Andmestiku muutujad:

In [948]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38723 entries, 0 to 38722
Data columns (total 41 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        38723 non-null  object 
 1   trial_date                38723 non-null  object 
 2   gender                    38723 non-null  object 
 3   age                       38723 non-null  float64
 4   weight                    38723 non-null  float64
 5   height                    38723 non-null  int64  
 6   bmi                       38723 non-null  float64
 7   gaAnkle                   38723 non-null  bool   
 8   gaChest                   38723 non-null  bool   
 9   gaWrist                   38723 non-null  bool   
 10  equivital                 38723 non-null  bool   
 11  cosmed                    38723 non-null  bool   
 12  EEm                       38723 non-null  float64
 13  COSMEDset_row             38723 non-null  int64  
 14  EEh                       38723 non-null  float64
 15  EEtot                     38723 non-null  float64
 16  METS                      38723 non-null  float64
 17  Rf                        38723 non-null  float64
 18  BR                        38723 non-null  int64  
 19  VT                        38723 non-null  float64
 20  VE                        38723 non-null  float64
 21  VO2                       38723 non-null  float64
 22  VCO2                      38723 non-null  float64
 23  O2exp                     38723 non-null  float64
 24  CO2exp                    38723 non-null  float64
 25  FeO2                      38723 non-null  float64
 26  FeCO2                     38723 non-null  float64
 27  FiO2                      38723 non-null  float64
 28  FiCO2                     38723 non-null  float64
 29  VE.VO2                    38723 non-null  float64
 30  VE.VCO2                   38723 non-null  float64
 31  R                         38723 non-null  float64
 32  Ti                        38723 non-null  float64
 33  Te                        38723 non-null  float64
 34  Ttot                      38723 non-null  float64
 35  VO2.HR                    38723 non-null  float64
 36  HR                        38723 non-null  int64  
 37  Qt                        38723 non-null  int64  
 38  SV                        38723 non-null  int64  
 39  original_activity_labels  24452 non-null  object 
 40  predicted_activity_label  11395 non-null  object 
dtypes: bool(5), float64(25), int64(6), object(5)
memory usage: 10.8+ MB
In [949]:
print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
print_prop_obj_count = len(df.select_dtypes(include=object).columns)

print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")
Andmestikus on 36 arvulist ja 5 mittearvulist muutujat
In [950]:
if len(_OBJ_CAST_) > 0:
    obj_cast_l = {}
    for i in _OBJ_CAST_:
        if i in df:
            obj_cast_l.update({i: str})
    
    df = df.astype(obj_cast_l)
        
    df.info()
In [951]:
if len(_OBJ_CAST_) > 0:
    
    print("Teisendame järgmised tunnused kategoriaalseteks:")
    print(f"\033[1m{[i for i in _OBJ_CAST_]}\033[0m")
    
    print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
    print_prop_obj_count = len(df.select_dtypes(include=object).columns)

    print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")

Andmestiku puhastamine¶

Duplikaatide kontroll¶

In [952]:
duplicates = df.duplicated(keep='first').sum();
print(f"Andmestikus on leitud {duplicates} duplikaate.")
Andmestikus on leitud 28 duplikaate.
In [953]:
df.drop_duplicates(keep='first',inplace=True)
duplicates_new = df.duplicated(keep='first').sum()
print(f"Peale puhastamist andmestikus on leitud {duplicates_new} duplikaate.")
Peale puhastamist andmestikus on leitud 0 duplikaate.

Tunnuste teisendamine¶

Puuduvate andmetega objektide kontroll¶

In [954]:
missing_values_validation = df.isna().sum()
missing_values_validation
Out[954]:
ID                              0
trial_date                      0
gender                          0
age                             0
weight                          0
height                          0
bmi                             0
gaAnkle                         0
gaChest                         0
gaWrist                         0
equivital                       0
cosmed                          0
EEm                             0
COSMEDset_row                   0
EEh                             0
EEtot                           0
METS                            0
Rf                              0
BR                              0
VT                              0
VE                              0
VO2                             0
VCO2                            0
O2exp                           0
CO2exp                          0
FeO2                            0
FeCO2                           0
FiO2                            0
FiCO2                           0
VE.VO2                          0
VE.VCO2                         0
R                               0
Ti                              0
Te                              0
Ttot                            0
VO2.HR                          0
HR                              0
Qt                              0
SV                              0
original_activity_labels    14263
predicted_activity_label    27328
dtype: int64

Puuduvate väärtustega 'N/A' kirjed puuduvad.

In [955]:
print(f"Puhastatud andmestikus on \033[1m{df.shape[0]}\033[0m ridu, neid iseloomustab \033[1m{df.shape[1]}\033[0m tunnust.")
Puhastatud andmestikus on 38695 ridu, neid iseloomustab 41 tunnust.

Tunnuste teisendamine¶

In [956]:
if len(_DROP_) > 0:
    prep_validated_drop_list = []
    for i in _DROP_:
        if i in df:
            prep_validated_drop_list.append(i)
    
    if len(prep_validated_drop_list) > 0:
        df.drop(columns=prep_validated_drop_list,axis = 1, inplace = True)
In [957]:
if _DROP_UNNAMED_:   
    prep_empty_cols = df.columns.str.contains('unnamed',case = False)

    if len(np.where(prep_empty_cols == True)) > 0:
        df.drop(df.columns[prep_empty_cols],axis = 1, inplace = True)
    
df
Out[957]:
gender EEtot METS Rf BR VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR HR original_activity_labels
0 female 0.0000 0.9528 21.5054 92 0.4091 8.7978 73.1476 12.4463 17.8803 3.0424 2.1250 102 NaN
1 female 0.0097 1.0143 24.5902 90 0.4642 11.4144 85.4919 11.8342 18.4176 2.5495 2.2403 103 NaN
2 female 0.0940 1.9223 22.5564 84 0.8774 19.7902 159.3537 25.3007 18.1628 2.8837 4.2051 104 NaN
3 female 0.2080 2.0189 20.0000 85 0.9243 18.4858 164.5567 30.6057 17.8035 3.3113 4.3329 106 lyingDownRight
4 female 0.2703 1.2154 22.0588 89 0.5876 12.9624 107.3240 16.2210 18.2639 2.7604 2.6086 106 lyingDownRight
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38718 female 85.4218 1.2031 21.8182 90 0.5650 12.3281 102.0947 13.8710 18.0686 2.4549 3.9375 77 NaN
38719 female 85.4871 0.9703 19.8675 92 0.5314 10.5573 96.9951 11.8312 18.2534 2.2265 3.1757 77 NaN
38720 female 85.5828 2.2391 26.7857 87 0.6436 17.2386 110.4581 19.6846 17.1632 3.0586 7.3279 77 NaN
38721 female 85.7260 1.3556 14.9254 92 0.7017 10.4733 120.3514 22.2344 17.1512 3.1686 4.4365 77 NaN
38722 female 86.1708 1.4569 24.7934 89 0.6028 14.9449 109.3362 12.8511 18.1387 2.1320 4.7681 77 NaN

38695 rows × 14 columns

In [958]:
print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
print_prop_obj_count = len(df.select_dtypes(include=object).columns)

print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")
Andmestikus on 12 arvulist ja 2 mittearvulist muutujat

Andmestik sobib mudelitele ilma muutusteta.

Tunnuste kirjeldus¶

Arvuliste tunnuste kirjeldus¶

Arvuliste tunnuste karakteristikud:

In [959]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.describe().T
Out[959]:
count mean std min 25% 50% 75% max
EEtot 38695.00 73.08 52.11 0.00 34.37 63.59 100.59 291.44
METS 38695.00 2.85 1.81 0.00 1.47 2.39 3.87 15.23
Rf 38695.00 23.99 10.80 2.89 18.13 22.30 27.78 375.00
BR 38695.00 82.02 10.42 23.00 77.00 85.00 90.00 99.00
VT 38695.00 1.19 0.60 0.04 0.76 1.08 1.50 4.64
VE 38695.00 27.73 16.97 0.20 15.40 22.84 35.15 132.13
O2exp 38695.00 207.24 103.41 7.55 134.52 186.63 257.73 969.38
CO2exp 38695.00 39.00 23.83 0.00 21.42 33.96 51.19 167.76
FeO2 38695.00 17.49 0.75 12.74 17.05 17.51 17.94 22.43
FeCO2 38695.00 3.11 0.68 0.00 2.70 3.09 3.53 6.11
VO2.HR 38695.00 7.94 4.99 0.00 4.60 7.55 11.12 39.39
HR 38695.00 81.68 36.18 0.00 69.00 83.00 104.00 203.00

Visualiseerime arvuliste muutujate jaotused. Selleks eraldame andmestikust arvulised muutujad:

In [960]:
feature_columns=df.drop(_SIHTTUNNUS_, axis=1).select_dtypes(exclude=object).columns
feature_columns
Out[960]:
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
       'VO2.HR', 'HR'],
      dtype='object')

Arvuliste tunnuste jaotuste visualiseerimine: histogramm + karpdiagramm

In [961]:
fig, axs = plt.subplots(len(feature_columns),2,dpi=95,figsize=(15,30))
i = 0
for col in feature_columns:
    df[col].plot(kind='hist',ax=axs[i,0], title=col, color="steelblue")
    df[col].plot(kind='box',vert=False,ax=axs[i,1], title=col,
                                             patch_artist = True,
           boxprops = dict(facecolor = "steelblue"),
                                             medianprops = dict(color = "red", linewidth = 1.5)).set_yticklabels('')

    i+=1
fig.tight_layout()
plt.show()
No description has been provided for this image

Enamik arvulisi tunnuseid on parempoolse assümeetriaga. Vurrdiagrammidest on näha, et outliereid on rohkelt.

Mittearvuliste atribuutide kirjeldus¶

Mittearvuliste tunnuste karakteristikud:

In [962]:
df.describe(include=[object]).T
Out[962]:
count unique top freq
gender 38695 2 male 24786
original_activity_labels 24432 16 cycling 4212

Kõige sagedamini esineb käesolevas andmestikus rattaga sõitmine meestel.

Mittearvuliste tunnuste väärtuste sagedustabelid:

In [963]:
for column in df.select_dtypes(include=object).columns:
    print(column)
    print(df[column].value_counts().sort_index())
    print()
gender
gender
female    13909
male      24786
Name: count, dtype: int64

original_activity_labels
original_activity_labels
cycling            4212
dishwashing        1885
lyingDownLeft      1459
lyingDownRight     1307
sittingChair       1546
sittingCouch       1623
sittingSofa        1549
stakingShelves     1689
standing           1102
step                413
syncJumping         161
vacuumCleaning     1744
walkingFast        1997
walkingNormal      1883
walkingSlow        1694
walkingStairsUp     168
Name: count, dtype: int64

Cycling - rattaga sõitmine on kõige sagedam tegevus käesolevas andmestikus.

Sihttunnuse Y kirjeldus¶

In [964]:
print(f"Sihttunnus: Y = \033[1m{_SIHTTUNNUS_}\033[0m.")
Sihttunnus: Y = BR.

Sihttunnuse jaotuse visualiseerimine:

In [965]:
fig, axs = plt.subplots(1,2,dpi=95,figsize=(15,5))
df[_SIHTTUNNUS_].plot(kind='hist',ax=axs[0], title="{}".format(_SIHTTUNNUS_), color="steelblue")
df[_SIHTTUNNUS_].plot(kind='box',vert=False,ax=axs[1], title="{}".format(_SIHTTUNNUS_),
                                             patch_artist = True,
           boxprops = dict(facecolor = "steelblue"),
                                             medianprops = dict(color = "red", linewidth = 1.5))
plt.show()
No description has been provided for this image

Sihttunnuse histogramm näitab vasakpoolset assümmeetriat. Mudelite konstrueerimisel on saadud piisav jõudlus, mis ei nõua sihttunnuse teisendamist.

Seoste analüüs¶

Seosed arvuliste tunnuste vahel¶

Visualiseerime arvuliste tunnuste vahelised sõltuvused

In [976]:
sns.pairplot(df.select_dtypes(exclude=object))
plt.show()
No description has been provided for this image

Lineaarne sõltuvus on selgelt väljendatud järgmiste tunnuste vahel:

  • BR (Breath Rate) - \ - VE (Expiratory Minute Ventilation (litre/min))
  • VT (Tidal Volume in litre) - / - O2exp (Volume of O2 expired (ml/min))
  • VT (Tidal Volume in litre) - / - CO2exp (Volume of CO2 expired (ml/min))
  • O2exp (Volume of O2 expired (ml/min)) - / - CO2exp (Volume of CO2 expired (ml/min))
  • FeO2 (Averaged expiratory concentration of O2 (%)) - \ - FeCO2 (Averaged expiratory concentration of CO2 (%))

Teistel tunnuste paaridel on nõrk sõltuvus või suur hajutatus ja keerulisem sõltuvuse struktuur.

Korrelatsioonimaatriks¶

Arvuliste tunnuste korrelatsioonimaatriks:

In [ ]:
df.select_dtypes(exclude=object).corr()
Out[ ]:
EEtot METS Rf BR VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR HR
EEtot 1.00 0.46 0.18 -0.51 0.43 0.55 0.44 0.40 -0.02 0.16 0.35 0.32
METS 0.46 1.00 0.26 -0.91 0.73 0.90 0.71 0.77 -0.44 0.56 0.69 0.39
Rf 0.18 0.26 1.00 -0.35 -0.14 0.32 -0.13 -0.15 0.23 -0.19 0.10 0.14
BR -0.51 -0.91 -0.35 1.00 -0.73 -0.97 -0.74 -0.67 0.14 -0.31 -0.63 -0.47
VT 0.43 0.73 -0.14 -0.73 1.00 0.79 1.00 0.95 -0.32 0.47 0.68 0.44
VE 0.55 0.90 0.32 -0.97 0.79 1.00 0.80 0.73 -0.15 0.31 0.71 0.50
O2exp 0.44 0.71 -0.13 -0.74 1.00 0.80 1.00 0.92 -0.25 0.41 0.67 0.44
CO2exp 0.40 0.77 -0.15 -0.67 0.95 0.73 0.92 1.00 -0.51 0.67 0.68 0.39
FeO2 -0.02 -0.44 0.23 0.14 -0.32 -0.15 -0.25 -0.51 1.00 -0.91 -0.37 -0.01
FeCO2 0.16 0.56 -0.19 -0.31 0.47 0.31 0.41 0.67 -0.91 1.00 0.42 0.12
VO2.HR 0.35 0.69 0.10 -0.63 0.68 0.71 0.67 0.68 -0.37 0.42 1.00 0.58
HR 0.32 0.39 0.14 -0.47 0.44 0.50 0.44 0.39 -0.01 0.12 0.58 1.00

Korrelatsioonimaatriksi heatmap visualiseermine. Eraldame andmestikust mittearvulised tunnused:

In [ ]:
num_f=df.select_dtypes(exclude=object)
In [ ]:
plt.figure(figsize=(16,16))
sns.heatmap(num_f.corr(), annot=True, fmt= '.2f')
plt.show()
No description has been provided for this image

Korrelatsiooni ülevaade diverging_palette'is.

In [ ]:
plt.figure(figsize=(16,16))
sns.set(font_scale=1.0)
hm = sns.heatmap(num_f.corr(), 
                 cbar=True, 
                 annot=True, 
                 square=True, 
                 fmt='.2f',
                 annot_kws={'size': 10}, 
                 yticklabels=num_f.columns,
                 xticklabels=num_f.columns,
                 cmap=sns.diverging_palette(10, 220, sep=30, n=256),
                 center=0.0)
plt.show()
No description has been provided for this image

Kõige tugevamad seosed on tunnustel:

  • BR ja VE (-0.97)
  • BR ja METS (-0.91)
  • FeCO2 ja FeO2 (-0.91)

On piisavalt tugevaid positiivseid seoseid.

Clustermap järjestab read ja veerud sellisel viisil, et sarnaste väärtustega / lähedased veerud paiknevad diagrammil lähemal. Niivisi rganiseerides eelnevalt saadud korrelatsioonimaatriksid, näeme tunnuste rühmasid tugevate omavaheliste seostega. See võimaldab teha järeldust multikollineaarsuse kohta.

In [ ]:
sns.set(font_scale=1.0)
km = sns.clustermap(num_f.corr(), 
                    cbar=True, 
                    annot=True,
                    fmt='.2f',
                    annot_kws={'size': 10}, 
                    yticklabels=num_f.columns,
                    xticklabels=num_f.columns,
                    cmap=sns.diverging_palette(10, 220, sep=30, n=256),
                    center=0.0)
plt.show()
No description has been provided for this image

Mõned tunnused on gruppeeritud esimesel tasandil. Rohkem kui kahe liikmega esimesel tasandil klastreid ei esine. Se tähendab, et tugevat multikollineaarsust ei ole.

Leiame arvuliste tunnuste korrelatsioonid (Pearsoni korrelatsioonikordajad) sihttunnusega, sorteerides need kasvavas järjekorras. See võimaldab näha, millised tunnused on kõige nõrgemas või tugevamas seoses sihttunnusega BR.

In [ ]:
pd.set_option('display.float_format', lambda x: '%.4f' % x)
num_f.corrwith(df[_SIHTTUNNUS_]).sort_values()
Out[ ]:
VE       -0.9704
METS     -0.9097
O2exp    -0.7376
VT       -0.7272
CO2exp   -0.6652
VO2.HR   -0.6300
EEtot    -0.5134
HR       -0.4657
Rf       -0.3460
FeCO2    -0.3060
FeO2      0.1410
BR        1.0000
dtype: float64

Kuvame seda graafiliselt.

In [ ]:
plt.figure(dpi=130,figsize=(1,4))
sns.set(font_scale=0.8)
sns.heatmap(pd.DataFrame(num_f.corrwith(df[_SIHTTUNNUS_]).sort_values()), fmt='.2f',
            annot=True, cmap=sns.diverging_palette(10, 220, sep=30, n=256),
                 center=0.0)
plt.show()
No description has been provided for this image

Kõige mõjukamad tunnused peale puhastamist ja esimese lineaarregressiooni mudeli sobitmist on:

  • VE Expiratory Minute Ventilation (litre/min)
  • METS Metabolic Equivalent per minute
  • O2exp Volume of O2 expired (ml/min)
  • VT Tidal Volume in litre
  • CO2exp Volume of CO2 expired (ml/min)
  • VO2.HR Oxygen pulse (ml/beat)

Selgub, et sügavam hingamine suurema hulga hapniku väljahingamisega mõjutab hingamise sageduse langemisele kõige rohkem. Samas, metabolismi ekvivalent maandab hingamise sageduse, mis on huvitav seos, väärt uurimist.

Sihttunnuse seos mittearvuliste tunnustega¶

Arvutame ja kuvame sihtuunnuse seoseid mittearvuliste tunnustega boxplot diagrammidega.

In [ ]:
categ_columns=df.select_dtypes(include=object).columns
fig, axs = plt.subplots(len(categ_columns),1,dpi=95,figsize=(15,25))
i = 0
for col in categ_columns:
        df.boxplot(
            column=[_SIHTTUNNUS_],
            by=col,
            ax=axs[i],
            patch_artist = True,
            boxprops = dict(facecolor = "steelblue"),
            medianprops = dict(color = "red", linewidth = 1.5)
        )
        i+=1
plt.suptitle('')
fig.tight_layout()
plt.show()
No description has been provided for this image

Tunnusega BR sõltuvuse struktuur on ühtlane, kuid tunnuse original_activity_labels väärtusi cycling, ning koos väärtusega walkingFast neid on kõige rohkem ja nad on hajutatud ühtlasem kui teised.

Uurime sihttunnuse varieeruvuse erinevate kategooriliste tunnuste väärtuste vahel. Saame ülevaate nende jaotusest ja statistilistest omadustest.

In [ ]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
for col in categ_columns:
    print(df.groupby(col)[_SIHTTUNNUS_].describe())
    print()
          count  mean   std   min   25%   50%   75%   max
gender                                                   
female 13909.00 82.22 10.11 34.00 78.00 85.00 90.00 99.00
male   24786.00 81.90 10.58 23.00 77.00 85.00 90.00 99.00

                           count  mean   std   min   25%   50%   75%   max
original_activity_labels                                                  
cycling                  4212.00 64.75 11.82 23.00 57.00 64.00 73.00 99.00
dishwashing              1885.00 86.46  3.90 63.00 84.00 87.00 89.00 98.00
lyingDownLeft            1459.00 86.42  5.25 53.00 84.00 87.00 90.00 99.00
lyingDownRight           1307.00 90.19  3.40 62.00 89.00 91.00 92.00 98.00
sittingChair             1546.00 90.25  3.33 61.00 89.00 91.00 92.00 99.00
sittingCouch             1623.00 90.28  3.15 69.00 89.00 91.00 92.00 98.00
sittingSofa              1549.00 90.32  3.65 63.00 89.00 91.00 93.00 98.00
stakingShelves           1689.00 85.42  4.80 62.00 83.00 86.00 89.00 98.00
standing                 1102.00 83.45  5.76 61.00 80.00 84.00 88.00 99.00
step                      413.00 83.70  4.94 68.00 81.00 83.00 87.00 97.00
syncJumping               161.00 86.94  6.27 61.00 85.00 88.00 91.00 98.00
vacuumCleaning           1744.00 82.10  5.83 49.00 78.00 83.00 86.00 99.00
walkingFast              1997.00 72.11  8.75 45.00 67.00 72.00 77.00 99.00
walkingNormal            1883.00 76.98  7.32 43.00 72.00 77.00 82.00 98.00
walkingSlow              1694.00 82.11  5.78 62.00 78.00 82.00 86.00 98.00
walkingStairsUp           168.00 87.27  5.12 62.00 84.75 88.00 91.00 97.00

Kategoorias cycling on kõige sagedam, aga keskmine BR väärtus selle juures on kõige madalam, mis on päris huvitav fakt.

Lineaarregressioon¶

Mudeli konstrueerimine: arvuliste prediktoritega mudel (ilma kategoriaalsete tunnusteta)¶

Andmete ettevalmistamine¶

In [ ]:
num_f=df.select_dtypes(exclude=object)
X = num_f.drop([_SIHTTUNNUS_],axis=1)
y = num_f[_SIHTTUNNUS_]

Treening- ja testandmete eraldamine¶

Selleks et kontrollida, kuidas ennustav mudel töötab uute andmetega, jagame andmestiku treening- (X_train, y_train) ja testandmeteks (X_test, y_test) jaotusega: 20% test- ja 80% treeningandmed. Selle jaoks kasutame mooduli model_selection funktsiooni train_test_split().

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Andmete standardiseerimine¶

Standardiseerimine aitab vähendada omaduste mõju teistele parameetritele, mis võivad olla erinevates suurustes. See tagab mudeli tõhusamat õppimist ja vähendab arvutuse aega. Näiteks, tunnus loudness erineb teistest tunnustest oma väärtuste vahemiku poolest.

In [ ]:
pd.set_option('display.float_format', lambda x: '%.4f' % x)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

std_df = pd.DataFrame(X_train_std, columns=X.columns)
std_df.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
EEtot 30956.0000 -0.0000 1.0000 -1.4077 -0.7433 -0.1782 0.5320 4.1906
METS 30956.0000 0.0000 1.0000 -1.5789 -0.7663 -0.2625 0.5628 6.8422
Rf 30956.0000 -0.0000 1.0000 -1.9714 -0.5403 -0.1544 0.3578 32.8557
VT 30956.0000 -0.0000 1.0000 -1.9071 -0.7096 -0.1964 0.5062 5.6983
VE 30956.0000 -0.0000 1.0000 -1.6217 -0.7266 -0.2861 0.4337 6.1408
O2exp 30956.0000 0.0000 1.0000 -1.9263 -0.7007 -0.2029 0.4874 7.3464
CO2exp 30956.0000 0.0000 1.0000 -1.6377 -0.7359 -0.2141 0.5081 5.3928
FeO2 30956.0000 -0.0000 1.0000 -6.2942 -0.5822 0.0304 0.5923 6.5488
FeCO2 30956.0000 0.0000 1.0000 -4.5911 -0.6006 -0.0221 0.6297 4.4391
VO2.HR 30956.0000 0.0000 1.0000 -1.5906 -0.6696 -0.0786 0.6387 5.4592
HR 30956.0000 -0.0000 1.0000 -2.2608 -0.3523 0.0350 0.6158 3.3541

Andmed on standardiseeritud ja tunnuste väärtuste vahe on nüüd väiksem.

Mudeli loomine¶

Ehitame lineaarregressiooni mudeli funktsiooni LinearRegression() abil.

In [ ]:
slr = LinearRegression()

slr.fit(X_train_std, y_train)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Mudeli testimine¶

Mudeli kordajad: Iga kordaja väärtus coef_ vastab tunnuse mõjule mudeli ennustustes.

In [ ]:
slr.coef_
Out[ ]:
array([ 0.21141921, -4.86764595, -0.16945791,  4.94370014, -6.65009999,
       -7.21399124,  3.17396404, -0.77187586, -1.11605182,  1.19794119,
       -0.43799687])

Kordajate nimekiri:

In [ ]:
for col_name, x_i in zip(X.columns, slr.coef_):
    print(col_name  + "\t", round(x_i, 4))
EEtot	 0.2114
METS	 -4.8676
Rf	 -0.1695
VT	 4.9437
VE	 -6.6501
O2exp	 -7.214
CO2exp	 3.174
FeO2	 -0.7719
FeCO2	 -1.1161
VO2.HR	 1.1979
HR	 -0.438

Mudeli kordajate visualiseerimine:

In [ ]:
coefs = pd.DataFrame(slr.coef_, columns=["Coefficients"], index=X.columns)
coefs
Out[ ]:
Coefficients
EEtot 0.2114
METS -4.8676
Rf -0.1695
VT 4.9437
VE -6.6501
O2exp -7.2140
CO2exp 3.1740
FeO2 -0.7719
FeCO2 -1.1161
VO2.HR 1.1979
HR -0.4380

Kõige suurema mõju mudeli ennustusvõimele avaldavad kordajad:

  • O2exp
  • VE
  • VT
  • METS

Kordajate visualiseerimine:

In [ ]:
coefs.plot(kind="barh", figsize=(9, 7))
plt.title("MLR model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
No description has been provided for this image

Kontrollime kordajate stabiilsust ehk nende varieeruvust mudeli korduval konstrueerimisel:

In [ ]:
cv_model = cross_validate(
    slr,
    X_train_std,
    y_train,
    cv=10,
    n_jobs=1
)
coefs = pd.DataFrame(
    [slr.coef_ for model in cv_model],
    columns=X.columns,
)
plt.figure(figsize=(9, 7))
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.xlabel("Coefficient importance")
plt.title("Coefficient importance and its variability")
plt.subplots_adjust(left=0.3)
No description has been provided for this image

Kordajad ei ole tasakaalus nullpunkti suhtes. Mitte ühtlane ja kõrge varieeruvus kordajate ümber viitab sellele, et mudel ei ole stabiilne ja kordajad ei ole järjepidevad erinevatel andmestikel mudeli korduval konstrueerimisel.

Leiame lineaarse regressiooni mudeli täpsust ristvalideerimise abil, arvutades keskmise R2 täpsuse ristvalideerimise tulemuste põhjal. Saame eraldi täpsused treening- ja testandmetel.

In [ ]:
scores = cross_val_score(estimator=slr,
                         X=X_train_std,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('R2 täpsus treeningandmetel: %.3f' % slr.score(X_train_std, y_train))
print('R2 täpsus testandmetel: %.3f' % slr.score(X_test_std, y_test))
CV keskmine R2 täpsus: 0.970 +/- 0.001
R2 täpsus treeningandmetel: 0.970
R2 täpsus testandmetel: 0.970

Mudeli RMSE leidmine:

In [ ]:
scores = cross_val_score(estimator=slr,
                         X=X_train_std,
                         y=y_train,
                         scoring = 'neg_mean_squared_error',
                         cv=10,
                         n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,slr.predict(X_train_std))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,slr.predict(X_test_std))))
CV keskmine RMSE: 1.808 +/- 0.042
RMSE treeningandmetel: 1.807
RMSE testandmetel: 1.786

Mudeli jäägid ehk vead:

In [ ]:
residuals=y_train-slr.predict(X_train_std)

Mudeli standardiseeritud jäägid ehk vead:

In [ ]:
std_residuals=residuals/np.std(residuals)

Mudeli diagnostika diagrammid:

In [ ]:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
plt.style.use("seaborn-v0_8-whitegrid")
# Residual against fitted values
axs[0, 0].scatter(x=slr.predict(X_train_std), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')

# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')

# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=slr.predict(X_train_std))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')

# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
No description has been provided for this image

Residuals vs Fitted ehk Jäägid vs Prognoosid: punktid ei ole juhuslikult hajutatud x=0 joone ümber. Mõne funktsionaalsuse olemasolu viitab mudeli parandamise võimalusele nt kõrgemate astmetega komponentide lisamise teel.

Sihttunnus Y on vasakpoolse asümmeetriaga. Järelikult, logaritmi teisenduse kasutamine võib olla vähem efektiivne või ebatõhus. Logaritmi teisendus on tavaliselt efektiivne parempoolse asümmeetriaga tunnuste korral, kuna see aitab muuta jaotuse sümmeetrilisemaks. Vasakpoolse asümmeetriaga tunnuste puhul võib logaritmi teisendus aga kaotada olulist teavet või muuta andmed ebatäpselt interpreteeritavaks.

Normal Q-Q ehk kvantiil-kvantiil diagramm: Mida lähedam standardiseeritud jääkide kvantiilid standardse normaaljaotuse kvantiilidele, seda parem.

Actual vs Fitted ehk Tegelikud väärtused vs Prognoosid diagramm: Mudel on piisavalt hästi sobitatud treeningandmetele, kuna punktid paiknevad y=x punase sirgjoone lähedal.

Mudel prognoosib erineva BR väärtusega lauausid ühtlaselt.

Mudel kategooriliste tunnustega¶

Teisendame kategoorilised tunnused, mis võimaldab neid kasutada mudelis. Teeme uue andmestruktuuri, kus iga kategooriline tunnus on asendatud mitme uue tunnusega, eeldusega, et konkreetne vaatlus vastab selle kategooria jaoks (1). Kõige esimene kategooria jäetud välja, et vältida dummy muutujate omavahelist sõltuvust. Uus andmestruktuur sisaldab kategooriliste tunnuste asendusi dummy muutujatega.

In [ ]:
X_dummy = pd.get_dummies(data=df.drop([_SIHTTUNNUS_],axis=1), drop_first=True)
X_dummy.head
Out[ ]:
<bound method NDFrame.head of         EEtot   METS      Rf     VT      VE    O2exp  CO2exp    FeO2  FeCO2  \
0      0.0000 0.9528 21.5054 0.4091  8.7978  73.1476 12.4463 17.8803 3.0424   
1      0.0097 1.0143 24.5902 0.4642 11.4144  85.4919 11.8342 18.4176 2.5495   
2      0.0940 1.9223 22.5564 0.8774 19.7902 159.3537 25.3007 18.1628 2.8837   
3      0.2080 2.0189 20.0000 0.9243 18.4858 164.5567 30.6057 17.8035 3.3113   
4      0.2703 1.2154 22.0588 0.5876 12.9624 107.3240 16.2210 18.2639 2.7604   
...       ...    ...     ...    ...     ...      ...     ...     ...    ...   
38718 85.4218 1.2031 21.8182 0.5650 12.3281 102.0947 13.8710 18.0686 2.4549   
38719 85.4871 0.9703 19.8675 0.5314 10.5573  96.9951 11.8312 18.2534 2.2265   
38720 85.5828 2.2391 26.7857 0.6436 17.2386 110.4581 19.6846 17.1632 3.0586   
38721 85.7260 1.3556 14.9254 0.7017 10.4733 120.3514 22.2344 17.1512 3.1686   
38722 86.1708 1.4569 24.7934 0.6028 14.9449 109.3362 12.8511 18.1387 2.1320   

       VO2.HR  ...  original_activity_labels_sittingSofa  \
0      2.1250  ...                                 False   
1      2.2403  ...                                 False   
2      4.2051  ...                                 False   
3      4.3329  ...                                 False   
4      2.6086  ...                                 False   
...       ...  ...                                   ...   
38718  3.9375  ...                                 False   
38719  3.1757  ...                                 False   
38720  7.3279  ...                                 False   
38721  4.4365  ...                                 False   
38722  4.7681  ...                                 False   

       original_activity_labels_stakingShelves  \
0                                        False   
1                                        False   
2                                        False   
3                                        False   
4                                        False   
...                                        ...   
38718                                    False   
38719                                    False   
38720                                    False   
38721                                    False   
38722                                    False   

       original_activity_labels_standing  original_activity_labels_step  \
0                                  False                          False   
1                                  False                          False   
2                                  False                          False   
3                                  False                          False   
4                                  False                          False   
...                                  ...                            ...   
38718                              False                          False   
38719                              False                          False   
38720                              False                          False   
38721                              False                          False   
38722                              False                          False   

       original_activity_labels_syncJumping  \
0                                     False   
1                                     False   
2                                     False   
3                                     False   
4                                     False   
...                                     ...   
38718                                 False   
38719                                 False   
38720                                 False   
38721                                 False   
38722                                 False   

       original_activity_labels_vacuumCleaning  \
0                                        False   
1                                        False   
2                                        False   
3                                        False   
4                                        False   
...                                        ...   
38718                                    False   
38719                                    False   
38720                                    False   
38721                                    False   
38722                                    False   

       original_activity_labels_walkingFast  \
0                                     False   
1                                     False   
2                                     False   
3                                     False   
4                                     False   
...                                     ...   
38718                                 False   
38719                                 False   
38720                                 False   
38721                                 False   
38722                                 False   

       original_activity_labels_walkingNormal  \
0                                       False   
1                                       False   
2                                       False   
3                                       False   
4                                       False   
...                                       ...   
38718                                   False   
38719                                   False   
38720                                   False   
38721                                   False   
38722                                   False   

       original_activity_labels_walkingSlow  \
0                                     False   
1                                     False   
2                                     False   
3                                     False   
4                                     False   
...                                     ...   
38718                                 False   
38719                                 False   
38720                                 False   
38721                                 False   
38722                                 False   

       original_activity_labels_walkingStairsUp  
0                                         False  
1                                         False  
2                                         False  
3                                         False  
4                                         False  
...                                         ...  
38718                                     False  
38719                                     False  
38720                                     False  
38721                                     False  
38722                                     False  

[38695 rows x 27 columns]>
In [ ]:
X_dummy
Out[ ]:
EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR ... original_activity_labels_sittingSofa original_activity_labels_stakingShelves original_activity_labels_standing original_activity_labels_step original_activity_labels_syncJumping original_activity_labels_vacuumCleaning original_activity_labels_walkingFast original_activity_labels_walkingNormal original_activity_labels_walkingSlow original_activity_labels_walkingStairsUp
0 0.0000 0.9528 21.5054 0.4091 8.7978 73.1476 12.4463 17.8803 3.0424 2.1250 ... False False False False False False False False False False
1 0.0097 1.0143 24.5902 0.4642 11.4144 85.4919 11.8342 18.4176 2.5495 2.2403 ... False False False False False False False False False False
2 0.0940 1.9223 22.5564 0.8774 19.7902 159.3537 25.3007 18.1628 2.8837 4.2051 ... False False False False False False False False False False
3 0.2080 2.0189 20.0000 0.9243 18.4858 164.5567 30.6057 17.8035 3.3113 4.3329 ... False False False False False False False False False False
4 0.2703 1.2154 22.0588 0.5876 12.9624 107.3240 16.2210 18.2639 2.7604 2.6086 ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38718 85.4218 1.2031 21.8182 0.5650 12.3281 102.0947 13.8710 18.0686 2.4549 3.9375 ... False False False False False False False False False False
38719 85.4871 0.9703 19.8675 0.5314 10.5573 96.9951 11.8312 18.2534 2.2265 3.1757 ... False False False False False False False False False False
38720 85.5828 2.2391 26.7857 0.6436 17.2386 110.4581 19.6846 17.1632 3.0586 7.3279 ... False False False False False False False False False False
38721 85.7260 1.3556 14.9254 0.7017 10.4733 120.3514 22.2344 17.1512 3.1686 4.4365 ... False False False False False False False False False False
38722 86.1708 1.4569 24.7934 0.6028 14.9449 109.3362 12.8511 18.1387 2.1320 4.7681 ... False False False False False False False False False False

38695 rows × 27 columns

Treening- ja testandmete eraldamine

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.2, random_state=0)

Andmete standardiseerimine

Standardiseerime arvulised prediktorid ja ühendame neid teiste prediktoritega. Alguses eraldame arvulised tunnused:

In [ ]:
X_train_num = X_train[num_f.drop([_SIHTTUNNUS_],axis=1).columns]
X_train_num.head()
Out[ ]:
EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR HR
7985 120.9925 1.0065 9.9834 1.2768 12.7465 232.7153 31.6134 18.2268 2.4760 4.6235 64
16281 144.2626 3.4369 31.4136 0.9985 31.3650 174.7038 35.1855 17.4974 3.5240 6.5937 135
9594 103.5986 1.5816 18.2927 0.9607 17.5744 167.2611 29.5767 17.4098 3.0786 7.8152 68
9242 51.8238 2.3240 26.6667 0.9985 26.6258 174.7062 30.6985 17.4974 3.0746 11.8315 66
17936 129.4891 7.6673 32.2581 1.6683 53.8148 286.4388 65.4659 17.1699 3.9242 0.0000 0
In [ ]:
X_test_num = X_test[num_f.drop([_SIHTTUNNUS_],axis=1).columns]
X_test_num.head()
Out[ ]:
EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR HR
14331 60.2944 1.5900 22.3048 0.6506 14.5125 113.0974 16.8269 17.3824 2.5862 6.3275 73
3717 118.9672 3.1380 28.3019 1.1517 32.5956 197.4944 43.9670 17.1479 3.8175 10.5206 95
6326 46.2270 1.5039 22.1402 0.7110 15.7408 126.2795 19.9925 17.7618 2.8121 5.6145 75
33591 99.7225 1.6865 18.8679 0.8608 16.2412 153.3907 22.7434 17.8199 2.6422 5.6727 77
9741 129.1196 1.5356 20.6897 0.9097 18.8221 161.0398 27.0269 17.7018 2.9709 7.2668 71

Standardiseerimine:

In [ ]:
sc.fit(X_train_num)
X_train_std = sc.transform(X_train_num)
X_test_std = sc.transform(X_test_num)

Ühendame prediktoreid:

In [ ]:
X_train_num.columns
Out[ ]:
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
       'VO2.HR', 'HR'],
      dtype='object')
In [ ]:
X_train.columns
Out[ ]:
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
       'VO2.HR', 'HR', 'gender_male', 'original_activity_labels_dishwashing',
       'original_activity_labels_lyingDownLeft',
       'original_activity_labels_lyingDownRight',
       'original_activity_labels_sittingChair',
       'original_activity_labels_sittingCouch',
       'original_activity_labels_sittingSofa',
       'original_activity_labels_stakingShelves',
       'original_activity_labels_standing', 'original_activity_labels_step',
       'original_activity_labels_syncJumping',
       'original_activity_labels_vacuumCleaning',
       'original_activity_labels_walkingFast',
       'original_activity_labels_walkingNormal',
       'original_activity_labels_walkingSlow',
       'original_activity_labels_walkingStairsUp'],
      dtype='object')
In [ ]:
dummy_col=X_train.columns[~X_train.columns.isin(X_train_num.columns)]
dummy_col
Out[ ]:
Index(['gender_male', 'original_activity_labels_dishwashing',
       'original_activity_labels_lyingDownLeft',
       'original_activity_labels_lyingDownRight',
       'original_activity_labels_sittingChair',
       'original_activity_labels_sittingCouch',
       'original_activity_labels_sittingSofa',
       'original_activity_labels_stakingShelves',
       'original_activity_labels_standing', 'original_activity_labels_step',
       'original_activity_labels_syncJumping',
       'original_activity_labels_vacuumCleaning',
       'original_activity_labels_walkingFast',
       'original_activity_labels_walkingNormal',
       'original_activity_labels_walkingSlow',
       'original_activity_labels_walkingStairsUp'],
      dtype='object')
In [ ]:
X_train_std=pd.DataFrame(X_train_std, columns=X_train_num.columns).join(X_train[dummy_col].reset_index()).drop(['index'],axis=1)
X_train_std.head()
Out[ ]:
EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR ... original_activity_labels_sittingSofa original_activity_labels_stakingShelves original_activity_labels_standing original_activity_labels_step original_activity_labels_syncJumping original_activity_labels_vacuumCleaning original_activity_labels_walkingFast original_activity_labels_walkingNormal original_activity_labels_walkingSlow original_activity_labels_walkingStairsUp
0 0.9165 -1.0224 -1.3076 0.1345 -0.8833 0.2415 -0.3129 0.9779 -0.9337 -0.6636 ... False False False False False True False False False False
1 1.3635 0.3213 0.6981 -0.3264 0.2122 -0.3180 -0.1632 0.0106 0.6142 -0.2686 ... False False False False False False False False False False
2 0.5823 -0.7044 -0.5299 -0.3889 -0.5992 -0.3898 -0.3982 -0.1057 -0.0438 -0.0237 ... False False False False False False False False False False
3 -0.4122 -0.2940 0.2538 -0.3264 -0.0667 -0.3180 -0.3512 0.0106 -0.0497 0.7816 ... False False False False False False False False False False
4 1.0797 2.6602 0.7772 0.7827 1.5331 0.7596 1.1058 -0.4238 1.2053 -1.5906 ... False False False False False False False False False False

5 rows × 27 columns

In [ ]:
X_test_std=pd.DataFrame(X_test_std, columns=X_test_num.columns).join(X_test[dummy_col].reset_index()).drop(['index'],axis=1)
X_test_std.head()
Out[ ]:
EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 VO2.HR ... original_activity_labels_sittingSofa original_activity_labels_stakingShelves original_activity_labels_standing original_activity_labels_step original_activity_labels_syncJumping original_activity_labels_vacuumCleaning original_activity_labels_walkingFast original_activity_labels_walkingNormal original_activity_labels_walkingSlow original_activity_labels_walkingStairsUp
0 -0.2495 -0.6998 -0.1544 -0.9023 -0.7794 -0.9122 -0.9325 -0.1420 -0.7710 -0.3220 ... False False False False False False False False False False
1 0.8776 0.1560 0.4069 -0.0726 0.2846 -0.0982 0.2048 -0.4530 1.0478 0.5188 ... False False False False False False False False False False
2 -0.5197 -0.7474 -0.1698 -0.8025 -0.7071 -0.7851 -0.7999 0.3612 -0.4374 -0.4649 ... False False False False False False False False False False
3 0.5079 -0.6465 -0.4761 -0.5544 -0.6777 -0.5236 -0.6846 0.4382 -0.6883 -0.4532 ... False False False False False False False False False False
4 1.0726 -0.7299 -0.3056 -0.4733 -0.5258 -0.4498 -0.5051 0.2816 -0.2029 -0.1336 ... False False False False False True False False False False

5 rows × 27 columns

In [ ]:
X_train_std.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
EEtot 30956.0000 -0.0000 1.0000 -1.4077 -0.7433 -0.1782 0.5320 4.1906
METS 30956.0000 0.0000 1.0000 -1.5789 -0.7663 -0.2625 0.5628 6.8422
Rf 30956.0000 -0.0000 1.0000 -1.9714 -0.5403 -0.1544 0.3578 32.8557
VT 30956.0000 -0.0000 1.0000 -1.9071 -0.7096 -0.1964 0.5062 5.6983
VE 30956.0000 -0.0000 1.0000 -1.6217 -0.7266 -0.2861 0.4337 6.1408
O2exp 30956.0000 0.0000 1.0000 -1.9263 -0.7007 -0.2029 0.4874 7.3464
CO2exp 30956.0000 0.0000 1.0000 -1.6377 -0.7359 -0.2141 0.5081 5.3928
FeO2 30956.0000 -0.0000 1.0000 -6.2942 -0.5822 0.0304 0.5923 6.5488
FeCO2 30956.0000 0.0000 1.0000 -4.5911 -0.6006 -0.0221 0.6297 4.4391
VO2.HR 30956.0000 0.0000 1.0000 -1.5906 -0.6696 -0.0786 0.6387 5.4592
HR 30956.0000 -0.0000 1.0000 -2.2608 -0.3523 0.0350 0.6158 3.3541

Mudeli loomine

In [ ]:
slr = LinearRegression()

slr.fit(X_train_std, y_train)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Mudeli testimine

In [ ]:
for col_name, x_i in zip(X_train_std.columns, slr.coef_):
    print(col_name  + "\t", round(x_i, 4))
EEtot	 0.3412
METS	 -3.598
Rf	 -0.1433
VT	 -1.9545
VE	 -7.7387
O2exp	 -1.0557
CO2exp	 3.7042
FeO2	 -0.9021
FeCO2	 -1.1345
VO2.HR	 1.0252
HR	 -0.431
gender_male	 1.6887
original_activity_labels_dishwashing	 -0.031
original_activity_labels_lyingDownLeft	 -0.0819
original_activity_labels_lyingDownRight	 0.1211
original_activity_labels_sittingChair	 0.1296
original_activity_labels_sittingCouch	 0.2812
original_activity_labels_sittingSofa	 0.3217
original_activity_labels_stakingShelves	 -0.0936
original_activity_labels_standing	 0.679
original_activity_labels_step	 0.5068
original_activity_labels_syncJumping	 0.8992
original_activity_labels_vacuumCleaning	 -0.2189
original_activity_labels_walkingFast	 0.1547
original_activity_labels_walkingNormal	 0.4252
original_activity_labels_walkingSlow	 0.6267
original_activity_labels_walkingStairsUp	 0.1982
In [ ]:
coefs = pd.DataFrame(
    slr.coef_, columns=["Coefficients"], index=X_train_std.columns)
coefs
Out[ ]:
Coefficients
EEtot 0.3412
METS -3.5980
Rf -0.1433
VT -1.9545
VE -7.7387
O2exp -1.0557
CO2exp 3.7042
FeO2 -0.9021
FeCO2 -1.1345
VO2.HR 1.0252
HR -0.4310
gender_male 1.6887
original_activity_labels_dishwashing -0.0310
original_activity_labels_lyingDownLeft -0.0819
original_activity_labels_lyingDownRight 0.1211
original_activity_labels_sittingChair 0.1296
original_activity_labels_sittingCouch 0.2812
original_activity_labels_sittingSofa 0.3217
original_activity_labels_stakingShelves -0.0936
original_activity_labels_standing 0.6790
original_activity_labels_step 0.5068
original_activity_labels_syncJumping 0.8992
original_activity_labels_vacuumCleaning -0.2189
original_activity_labels_walkingFast 0.1547
original_activity_labels_walkingNormal 0.4252
original_activity_labels_walkingSlow 0.6267
original_activity_labels_walkingStairsUp 0.1982
In [ ]:
coefs.plot(kind="barh", figsize=(9, 7))
plt.title("MLR model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
No description has been provided for this image

Kõige kõrgema mõjuga koefitsiendid on:

  • VE
  • CO2exp
  • METS

Teeme ristvalideerimist mudeli täpsuse hindamiseks ja hindame mudeli keskmise täpsuse 10 erineva ristvalideerimise iteratsiooni põhjal.

In [ ]:
scores = cross_val_score(estimator=slr,
                         X=X_train_std,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('R2 täpsus treeningandmetel: %.3f' % slr.score(X_train_std, y_train))
print('R2 täpsus testandmetel: %.3f' % slr.score(X_test_std, y_test))
CV keskmine R2 täpsus: 0.974 +/- 0.001
R2 täpsus treeningandmetel: 0.974
R2 täpsus testandmetel: 0.974

Arvutame lineaarse regressiooni mudeli RMSE treeningandmetel, kasutades mean_squared_error funktsiooni, et saada RMSE treeningandmetel ja seejärel võtta sellest ruutjuur. See annab hinnangu sellele, kui hästi mudel ennustab treeningandmeid.

In [ ]:
scores = cross_val_score(estimator=slr,
                         X=X_train_std,
                         y=y_train,
                         scoring = 'neg_mean_squared_error',
                         cv=10,
                         n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,slr.predict(X_train_std))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,slr.predict(X_test_std))))
CV keskmine RMSE: 1.673 +/- 0.035
RMSE treeningandmetel: 1.671
RMSE testandmetel: 1.653

Mudeli jäägid ehk vead:

In [ ]:
residuals=y_train-slr.predict(X_train_std)

Mudeli standardiseeritud jäägid ehk vead:

In [ ]:
std_residuals=residuals/np.std(residuals)

Mudeli diagnostika graafikud:

In [ ]:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
plt.style.use("seaborn-v0_8-whitegrid")
# Residual against fitted values
axs[0, 0].scatter(x=slr.predict(X_train_std), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')

# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')

# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=slr.predict(X_train_std))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')

# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
No description has been provided for this image

Diagnostika diagrammid näitavad sarnased tulemused Lineaarregressiooni mudeliga ainult numbrilistel tunnustel. Samas, diagramm Residuals vs. Fitted näitam rohkem jääkide hajutatavust. See mudel on usaldusväärsem.

Polünomiaalregressioon¶

Andmete ettevalmistamine¶

Treening- ja testandmete eraldamine

In [ ]:
X = df.drop([_SIHTTUNNUS_],axis=1)
y = df[_SIHTTUNNUS_]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Eraldame kategoorilised ja numbrilised muutujad, kasutades nende identifitseerimiseks nende andmetüüpe. Nagu nägime eelnevalt, objekt vastab kategoorilistele veergudele. Kasutame vastavate veergude valimiseks make_column_selector.

In [ ]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)

numerical_columns
Out[ ]:
['EEtot',
 'METS',
 'Rf',
 'VT',
 'VE',
 'O2exp',
 'CO2exp',
 'FeO2',
 'FeCO2',
 'VO2.HR',
 'HR']
In [ ]:
categorical_columns
Out[ ]:
['gender', 'original_activity_labels']
In [ ]:
categorical_preprocessor = OneHotEncoder(drop='first')

Eelprtsessor numbriliste tunnuste jaoks peab sisaldama ka polünoommudeli astmed, seega kasutame konveieri:

In [ ]:
numerical_preprocessor = StandardScaler()
numerical_preprocessor = Pipeline([
    ('scaler', StandardScaler()),
    ('poly2', PolynomialFeatures(degree=2))
])

Nüüd loome ColumnTransfomer ja seostame eelprotsessorid vastavate veergudega:

In [ ]:
preprocessor = ColumnTransformer(
    [
        ("ctg", categorical_preprocessor, categorical_columns),
        ("num", numerical_preprocessor, numerical_columns),
    ]
)

Nüüd loome konveieri (pipeline), mis ühendab ColumnTransformer mudeliga:

In [ ]:
poly_lr = Pipeline([
    ('pre', preprocessor),
    ('lr', LinearRegression())
])

Mudeli loomine treeningandmetel¶

In [ ]:
poly_lr.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler()),
                                                                  ('poly2',
                                                                   PolynomialFeatures())]),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('lr', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler()),
                                                                  ('poly2',
                                                                   PolynomialFeatures())]),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('lr', LinearRegression())])
ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
                                 ['gender', 'original_activity_labels']),
                                ('num',
                                 Pipeline(steps=[('scaler', StandardScaler()),
                                                 ('poly2',
                                                  PolynomialFeatures())]),
                                 ['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
                                  'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])
['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
PolynomialFeatures()
LinearRegression()

Mudeli testimine

In [ ]:
preprocessor.get_feature_names_out()
Out[ ]:
array(['ctg__gender_male', 'ctg__original_activity_labels_dishwashing',
       'ctg__original_activity_labels_lyingDownLeft',
       'ctg__original_activity_labels_lyingDownRight',
       'ctg__original_activity_labels_sittingChair',
       'ctg__original_activity_labels_sittingCouch',
       'ctg__original_activity_labels_sittingSofa',
       'ctg__original_activity_labels_stakingShelves',
       'ctg__original_activity_labels_standing',
       'ctg__original_activity_labels_step',
       'ctg__original_activity_labels_syncJumping',
       'ctg__original_activity_labels_vacuumCleaning',
       'ctg__original_activity_labels_walkingFast',
       'ctg__original_activity_labels_walkingNormal',
       'ctg__original_activity_labels_walkingSlow',
       'ctg__original_activity_labels_walkingStairsUp',
       'ctg__original_activity_labels_nan', 'num__1', 'num__EEtot',
       'num__METS', 'num__Rf', 'num__VT', 'num__VE', 'num__O2exp',
       'num__CO2exp', 'num__FeO2', 'num__FeCO2', 'num__VO2.HR', 'num__HR',
       'num__EEtot^2', 'num__EEtot METS', 'num__EEtot Rf',
       'num__EEtot VT', 'num__EEtot VE', 'num__EEtot O2exp',
       'num__EEtot CO2exp', 'num__EEtot FeO2', 'num__EEtot FeCO2',
       'num__EEtot VO2.HR', 'num__EEtot HR', 'num__METS^2',
       'num__METS Rf', 'num__METS VT', 'num__METS VE', 'num__METS O2exp',
       'num__METS CO2exp', 'num__METS FeO2', 'num__METS FeCO2',
       'num__METS VO2.HR', 'num__METS HR', 'num__Rf^2', 'num__Rf VT',
       'num__Rf VE', 'num__Rf O2exp', 'num__Rf CO2exp', 'num__Rf FeO2',
       'num__Rf FeCO2', 'num__Rf VO2.HR', 'num__Rf HR', 'num__VT^2',
       'num__VT VE', 'num__VT O2exp', 'num__VT CO2exp', 'num__VT FeO2',
       'num__VT FeCO2', 'num__VT VO2.HR', 'num__VT HR', 'num__VE^2',
       'num__VE O2exp', 'num__VE CO2exp', 'num__VE FeO2', 'num__VE FeCO2',
       'num__VE VO2.HR', 'num__VE HR', 'num__O2exp^2',
       'num__O2exp CO2exp', 'num__O2exp FeO2', 'num__O2exp FeCO2',
       'num__O2exp VO2.HR', 'num__O2exp HR', 'num__CO2exp^2',
       'num__CO2exp FeO2', 'num__CO2exp FeCO2', 'num__CO2exp VO2.HR',
       'num__CO2exp HR', 'num__FeO2^2', 'num__FeO2 FeCO2',
       'num__FeO2 VO2.HR', 'num__FeO2 HR', 'num__FeCO2^2',
       'num__FeCO2 VO2.HR', 'num__FeCO2 HR', 'num__VO2.HR^2',
       'num__VO2.HR HR', 'num__HR^2'], dtype=object)
In [ ]:
poly_lr.named_steps['lr'].coef_
Out[ ]:
array([ 1.11588524e+00, -3.35223584e-02, -6.45980021e-02, -1.76905759e-01,
       -8.69381913e-02, -4.17115094e-02, -1.53874216e-01,  1.12142675e-01,
        2.75179062e-01, -1.12167018e-01,  2.49311864e-01, -1.39661167e-03,
       -6.66572688e-02, -2.50763177e-02,  1.42887814e-01, -6.78430960e-02,
       -7.31760935e-02, -2.53504494e+03,  6.31088389e-02, -6.10776529e+00,
        2.70223687e+08, -4.05085758e+08, -1.74074102e+08,  3.44535832e+08,
        2.57149489e+08, -5.28415641e+07, -6.19166777e+07,  7.29776039e+00,
        4.56303008e+00,  1.21282010e-02, -3.55187891e-01, -2.90894779e-02,
       -1.11653483e+00,  5.71564049e-01,  7.31470227e-01,  3.85120749e-01,
       -6.55072629e-02, -3.65646631e-02, -1.36274666e-01, -9.61393714e-02,
       -7.29222342e-01,  1.59336299e-01,  2.72875926e+01,  2.32483765e+00,
       -1.94251935e+01, -6.73374751e+00, -1.92198621e+00, -2.62710661e-01,
       -2.22766016e+00, -1.88767031e-01,  3.10957432e-04,  4.44047038e+06,
        1.36789352e-01, -1.12966873e+09,  1.54127620e+09, -3.34736034e-02,
        5.02254311e-02, -3.12185787e-01, -8.24527442e-02, -3.89861885e+01,
       -3.69538771e+01,  5.26733545e+01,  3.23250163e+01,  3.13555431e+07,
       -2.48821917e+08, -5.35449639e-01, -7.76449083e+00, -1.18688235e+00,
        2.89877477e+01,  7.32735864e+00,  1.30676841e+08, -6.95598797e+08,
        1.13887857e+00, -1.41304341e+00, -1.65728529e+01, -2.49536350e+01,
       -1.22697956e+01,  7.82218382e+08, -3.96077499e-01,  5.49233616e+00,
       -4.65898836e+00, -2.00492064e+08, -2.04347900e-01,  5.53791635e-01,
        2.22513276e+00,  2.94494927e-01,  6.73780438e-01,  6.71497986e-01,
        2.81234711e-01,  2.72435844e-01,  3.55855571e-01, -2.13263795e-01,
        8.28346810e-01,  5.19687173e+00,  4.30181950e-01])
In [ ]:
coefs = pd.DataFrame(
    poly_lr.named_steps['lr'].coef_, columns=["Coefficients"], index=preprocessor.get_feature_names_out())
coefs
Out[ ]:
Coefficients
ctg__gender_male 1.1159
ctg__original_activity_labels_dishwashing -0.0335
ctg__original_activity_labels_lyingDownLeft -0.0646
ctg__original_activity_labels_lyingDownRight -0.1769
ctg__original_activity_labels_sittingChair -0.0869
... ...
num__FeCO2 VO2.HR 0.3559
num__FeCO2 HR -0.2133
num__VO2.HR^2 0.8283
num__VO2.HR HR 5.1969
num__HR^2 0.4302

95 rows × 1 columns

In [ ]:
coefs.plot(kind="barh", figsize=(9, 12))
plt.title("Poly model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
No description has been provided for this image

Kategoriaalsed tunnused on kaotanud tähtsuse sellel mudelis võrreldes Lineaarse regressiooni mudeliga kategoriaalsete tunnustega.

In [ ]:
scores = cross_val_score(estimator=poly_lr,
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('Keskmine R2 täpsus treeningandmetel: %.3f' % poly_lr.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % poly_lr.score(X_test, y_test))
CV keskmine R2 täpsus: 0.984 +/- 0.001
Keskmine R2 täpsus treeningandmetel: 0.985
Keskmine R2 täpsus testandmetel: 0.984

Mudeli RMSE leidmine:

In [ ]:
scores = cross_val_score(estimator=poly_lr,
                         X=X_train,
                         y=y_train,
                         scoring = 'neg_mean_squared_error',
                         cv=10,
                         n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,poly_lr.predict(X_train))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,poly_lr.predict(X_test))))
CV keskmine RMSE: 1.309 +/- 0.029
RMSE treeningandmetel: 1.300
RMSE testandmetel: 1.290

Mudeli jäägid ehk vead:

In [ ]:
residuals=y_train-poly_lr.predict(X_train)

Mudeli standardiseeritud jäägid ehk vead:

In [ ]:
std_residuals=residuals/np.std(residuals)

Mudeli diagnostika graafikud:

In [ ]:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
# plt.style.use('seaborn')
# Residual against fitted values
axs[0, 0].scatter(x=poly_lr.predict(X_train), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')

# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')

# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=poly_lr.predict(X_train))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')

# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
No description has been provided for this image

Jääkide diagnostika näitab paremaid tulemusi diagrammil Fitted vs. Residuals, kui lineaarregressiooni mudelil. Paistab olevat parim mudel.

Polünomiaalregressiooni astme validatsioonikõver¶

In [ ]:
degrees = range(1,4)
a = []
c = []
for deg in degrees:
  numerical_preprocessor = Pipeline([
    ('scaler', StandardScaler()),
    ('poly2', PolynomialFeatures(degree=deg))])
  preprocessor = ColumnTransformer(
    [
        ("ctg", categorical_preprocessor, categorical_columns),
        ("num", numerical_preprocessor, numerical_columns),
    ])
  poly_lr = Pipeline([
    ('pre', preprocessor),
    ('lr', LinearRegression())])
  poly_lr.fit(X_train, y_train)
  cv_models = cross_validate(estimator=poly_lr,
                        X=X_train,
                        y=y_train,
                        return_estimator=True,
                        cv=10,
                        n_jobs=1)
  cv_fit = cv_models['estimator']
  c.append(r2_score(np.exp(y_train),np.exp(poly_lr.predict(X_train))))
  b = []
  for i in range(len(cv_fit)):
    b.append(r2_score(np.exp(y_test),np.exp(cv_fit[i].predict(X_test))))
  a.append(np.mean(b))
plt.figure(figsize=(6, 4))

plt.plot(degrees, a, lw=2,
         label='cross-validation test')
plt.plot(degrees, c, lw=2, label='train')

plt.legend(loc='best')
plt.xlabel('degree')
plt.ylabel('R2')
plt.title('Validation curve')
plt.tight_layout()
No description has been provided for this image

Polünomiaalregressiooni astme validatsioonikõverad (ristvalideerimise testandmetel ja treeningandmete) ei ole paralleelsed. Samas, nende kujud on iseloomustatud erievate murdenurgaga keskel. Ristvalideerimise test joon (sinine) läheb R2 väärtuse skaalas nullist alla -6 ni, ning sealt mutdub horisontaalse joonena edasi. Train joon läheb ka alla R2 -8 väärtuseni ja sealt murdub terava nurgaga umbes 90 kraadi võrra eelnevast kaldenurgast ülesse.

Selline mudeli käitumine võib vajada täiendavat uurimist ja modelleerimist selleks, et parandada mudeli ennustusvõimet ning vältida ületreenimist. Mudel võib olla liiga keeruline, mis omakorda võib põhjustada ületreenimist ja madalat üldistumist uute andmete suhtes. Võimalikud lahendused on mudeli lihtsustamine, regulaariseerimine (nt Ridge või Lasso) või teiste lihtsamate mudelite kaalumist.

Igale polünomiaalsele astmele vastab kõver, mis näitab, kuidas mudeli jõudlus (vastavalt Y-teljel olevale mõõdikule) muutub polünomiaalse astme suurenedes. Ideaalis soovite leida polünomiaalse astme, mis annab parima jõudluse, kus Y-teljel olev väärtus on maksimaalne.

Kui kõverad on murdepunktide või piikidega, võib see näidata, et mudeli jõudlus muutub dramaatiliselt polünomiaalse astme suurendamisel. Need punktid võivad olla olulised, kui otsite optimaalset mudeli keerukust.

Otsustuspuu regressioon¶

Andmete ettevalmistamine¶

Treening- ja testandmete eraldamine

In [ ]:
X = df.drop([_SIHTTUNNUS_],axis=1)
y = df[_SIHTTUNNUS_]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Otsustuspuu regressioonimudel¶

Konveier (Pipeline) kombineerib erinevad teisendused ja ennustajad kokku ühendobjektiks.

Numbriliste ja kategooriliste muutujate mudelis kasutamine.¶

Eraldame kategoorilised ja numbrilised muutujad, kasutades nende identifitseerimiseks nende andmetüüpe. Nagu nägime eelnevalt, objekt vastab kategoorilistele veergudele. Kasutame vastavate veergude valimiseks make_column_selector.

In [ ]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)

numerical_columns
Out[ ]:
['EEtot',
 'METS',
 'Rf',
 'VT',
 'VE',
 'O2exp',
 'CO2exp',
 'FeO2',
 'FeCO2',
 'VO2.HR',
 'HR']
In [ ]:
categorical_columns
Out[ ]:
['gender', 'original_activity_labels']

Me peame numbrilised ja kategoorilised andmeid valmistama ette modelleerimiseks erinevalt: kategoorilised andmed: tunnuste väärtuste indikaatortunnustega asendamine (one-hot encoding), numbrilised andmed: standardiseerimine/normaliseerimine. Scikit-learn pakub klassi ColumnTransformer, mis lubab jaotada konveieri (pipeline) kaheks osaks, edastades konkreetsed veerud konkreetsetele teisendusmeetoditele. See lubab ühendada mõlemat liiki muutujaid ühes konveieris koos.

In [ ]:
categorical_preprocessor = OneHotEncoder(drop='first')

Eelprotsessor numbriliste tunnuste jaoks:

In [ ]:
numerical_preprocessor = StandardScaler()

Nüüd loome ColumnTransfomer ja seostame eelprotsessorid vastavate veergudega:

In [ ]:
preprocessor = ColumnTransformer(
    [
        ("ctg", categorical_preprocessor, categorical_columns),
        ("num", numerical_preprocessor, numerical_columns),
    ]
)
tree_pipe = Pipeline([
    ('pre', preprocessor),
    ('tree', DecisionTreeRegressor(random_state=0))])
tree_pipe.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('tree', DecisionTreeRegressor(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('tree', DecisionTreeRegressor(random_state=0))])
ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
                                 ['gender', 'original_activity_labels']),
                                ('num', StandardScaler(),
                                 ['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
                                  'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])
['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
DecisionTreeRegressor(random_state=0)
In [ ]:
print('Keskmine R2 täpsus treeningandmetel: %.3f' % tree_pipe.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % tree_pipe.score(X_test, y_test))
Keskmine R2 täpsus treeningandmetel: 1.000
Keskmine R2 täpsus testandmetel: 0.975

RMSE enne pöördteisenduse exp():

In [975]:
mse = mean_squared_error(y_train, tree_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")

y_pred = tree_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 0.000
Random Forest mudeli RMSE testandmetel: 1.623

Rakendame pöördtesendust exp():

In [ ]:
print('R2 täpsus treeningandmetel: %.3f' % r2_score(np.exp(y_train),np.exp(tree_pipe.predict(X_train))))
print('R2 täpsus testandmetel: %.3f' % r2_score(np.exp(y_test),np.exp(tree_pipe.predict(X_test))))
R2 täpsus treeningandmetel: 1.000
R2 täpsus testandmetel: 0.778

RMSE peale pöördteisenduse exp():

In [ ]:
y_pred = np.exp(tree_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 123540818094081495006943483507626838327296.000

Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Otsustuspuu mudeli jaoks.

Mudeli tähtsamad argumendid:

In [ ]:
imp = pd.DataFrame(tree_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
Out[ ]:
Feature Importance
21 num__VE 0.9461
0 ctg__gender_male 0.0187
27 num__HR 0.0094
18 num__METS 0.0062
26 num__VO2.HR 0.0040
25 num__FeCO2 0.0039
23 num__CO2exp 0.0025
24 num__FeO2 0.0024
17 num__EEtot 0.0022
19 num__Rf 0.0021
22 num__O2exp 0.0009
20 num__VT 0.0006
16 ctg__original_activity_labels_nan 0.0002
2 ctg__original_activity_labels_lyingDownLeft 0.0001
12 ctg__original_activity_labels_walkingFast 0.0001
13 ctg__original_activity_labels_walkingNormal 0.0001
11 ctg__original_activity_labels_vacuumCleaning 0.0001
14 ctg__original_activity_labels_walkingSlow 0.0001
8 ctg__original_activity_labels_standing 0.0000
7 ctg__original_activity_labels_stakingShelves 0.0000
3 ctg__original_activity_labels_lyingDownRight 0.0000
1 ctg__original_activity_labels_dishwashing 0.0000
6 ctg__original_activity_labels_sittingSofa 0.0000
4 ctg__original_activity_labels_sittingChair 0.0000
9 ctg__original_activity_labels_step 0.0000
5 ctg__original_activity_labels_sittingCouch 0.0000
15 ctg__original_activity_labels_walkingStairsUp 0.0000
10 ctg__original_activity_labels_syncJumping 0.0000
In [ ]:
from sklearn.model_selection import GridSearchCV
In [ ]:
parameters={"tree__splitter":["best","random"],
            "tree__max_depth" : [1,3,5,7,9],
           "tree__min_samples_leaf":[1,2,3,4,5,6,7],
           "tree__max_features":["log2","sqrt",None],
           "tree__max_leaf_nodes":[None,10,20,30] }

gs_tree_pipe = GridSearchCV(estimator=tree_pipe, param_grid=parameters, cv=5, verbose=0)

gs_tree_pipe.fit(X_train, y_train)

gs_tree_pipe.best_params_
Out[ ]:
{'tree__max_depth': 9,
 'tree__max_features': None,
 'tree__max_leaf_nodes': None,
 'tree__min_samples_leaf': 7,
 'tree__splitter': 'best'}

Mudeli täpsus enne pöördteisenduse rakendamist:

In [ ]:
print(f"R2 score on train: {gs_tree_pipe.score(X_train, y_train):.3f}")
print(f"R2 score on train: {gs_tree_pipe.score(X_test, y_test):.3f}")
R2 score on train: 0.984
R2 score on train: 0.978

RMSE enne pöördteisenduse exp:

In [974]:
mse = mean_squared_error(y_train, gs_tree_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")

y_pred = gs_tree_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 1.330
Random Forest mudeli RMSE testandmetel: 1.537

Mudeli täpsus pärast pöördteisenduse exp() rakendamist:

In [ ]:
print(f"R2 score on train: {r2_score(np.exp(y_train),np.exp(gs_tree_pipe.predict(X_train))):.3f}")
print(f"R2 score on test: {r2_score(np.exp(y_test),np.exp(gs_tree_pipe.predict(X_test))):.3f}")
R2 score on train: 0.942
R2 score on test: 0.908

RMSE peale pöördteisenduse exp():

In [ ]:
y_pred = np.exp(gs_tree_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 79616768612770933240036264162054455164928.000

Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Otsustuspuu mudeli jaoks.

Random Forest regressioon¶

Mudeli loomine treeningandmetel

Loome Pipeline'i, mis ühendab andmete eeltöötlemise ja Random Forest regressioonmudeli, ning seejärel treenib mudeli kasutades etteantud treeningandmeid.

In [ ]:
rf_pipe = Pipeline([
    ('pre', preprocessor),
    ('rf', RandomForestRegressor(random_state=0))])

rf_pipe.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('rf', RandomForestRegressor(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('rf', RandomForestRegressor(random_state=0))])
ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
                                 ['gender', 'original_activity_labels']),
                                ('num', StandardScaler(),
                                 ['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
                                  'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])
['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
RandomForestRegressor(random_state=0)
In [ ]:
print('Keskmine R2 täpsus treeningandmetel: %.3f' % rf_pipe.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % rf_pipe.score(X_test, y_test))
Keskmine R2 täpsus treeningandmetel: 0.998
Keskmine R2 täpsus testandmetel: 0.989

RMSE:

In [971]:
mse = mean_squared_error(y_train, rf_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")

y_pred = rf_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 0.457
Random Forest mudeli RMSE testandmetel: 1.191

Rakendame pöördtesendust exp(), mudeli parandamiseks.

In [932]:
print('R2 täpsus treeningandmetel: %.3f' % r2_score(np.exp(y_train),np.exp(rf_pipe.predict(X_train))))
print('R2 täpsus testandmetel: %.3f' % r2_score(np.exp(y_test),np.exp(rf_pipe.predict(X_test))))
R2 täpsus treeningandmetel: 0.987
R2 täpsus testandmetel: 0.896

Mudeli tähtsamad argumendid:

In [933]:
imp = pd.DataFrame(rf_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
Out[933]:
Feature Importance
21 num__VE 0.9461
0 ctg__gender_male 0.0186
27 num__HR 0.0076
18 num__METS 0.0063
26 num__VO2.HR 0.0055
25 num__FeCO2 0.0036
19 num__Rf 0.0029
24 num__FeO2 0.0026
17 num__EEtot 0.0022
23 num__CO2exp 0.0019
22 num__O2exp 0.0010
20 num__VT 0.0008
16 ctg__original_activity_labels_nan 0.0002
11 ctg__original_activity_labels_vacuumCleaning 0.0001
2 ctg__original_activity_labels_lyingDownLeft 0.0001
13 ctg__original_activity_labels_walkingNormal 0.0001
12 ctg__original_activity_labels_walkingFast 0.0001
14 ctg__original_activity_labels_walkingSlow 0.0001
8 ctg__original_activity_labels_standing 0.0001
7 ctg__original_activity_labels_stakingShelves 0.0001
1 ctg__original_activity_labels_dishwashing 0.0000
4 ctg__original_activity_labels_sittingChair 0.0000
3 ctg__original_activity_labels_lyingDownRight 0.0000
6 ctg__original_activity_labels_sittingSofa 0.0000
9 ctg__original_activity_labels_step 0.0000
5 ctg__original_activity_labels_sittingCouch 0.0000
15 ctg__original_activity_labels_walkingStairsUp 0.0000
10 ctg__original_activity_labels_syncJumping 0.0000

GridSearchCV kasutamine võimaldab Random Forest mudeli hüperparameetrite optimeerimist ning eeldatavasti tagab kõrgemat täpsust. Kasutame parameetrite GridSearchCV tuunimist:

In [934]:
# param_grid_rf = {
#     'rf__n_estimators': [10, 50, 100, 500, 1000],
#     'rf__max_features': ['log2', 'sqrt', 0.8,1]
# }

param_grid_rf = {
    'rf__n_estimators': [10, 50, 100],
    'rf__max_features': ['log2', 'sqrt', 0.8,1]
}

gs_rf_pipe = GridSearchCV(estimator=rf_pipe, param_grid=param_grid_rf, cv=5, verbose=0)

gs_rf_pipe.fit(X_train, y_train)

gs_rf_pipe.best_params_
Out[934]:
{'rf__max_features': 0.8, 'rf__n_estimators': 100}

Mudeli täpsus enne pöördteisenduse rakendamist:

In [935]:
print(gs_rf_pipe.score(X_train, y_train))
print(gs_rf_pipe.score(X_test, y_test))
0.9985355652170951
0.9894740695731525

Mudeli RMSE:

In [970]:
mse = mean_squared_error(y_train, gs_rf_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")

y_pred = gs_rf_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 0.400
Random Forest mudeli RMSE testandmetel: 1.058

Mudeli täpsus peale pöördteisenduse exp() rakendamist:

In [936]:
print(f"R2 score on train: {r2_score(np.exp(y_train),np.exp(gs_rf_pipe.predict(X_train))):.3f}")
print(f"R2 score on test: {r2_score(np.exp(y_test),np.exp(gs_rf_pipe.predict(X_test))):.3f}")
R2 score on train: 0.986
R2 score on test: 0.910

Mudeli kirjeldusvõime peaaegu sama.

RMSE peale GridSearchCV tuunimist ja peale pöördteisendust exp():

In [969]:
y_pred = np.exp(gs_rf_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 256069566736770598900636151744551404437504.000

Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Random Forest mudeli jaoks.

Mudeli argumendid:

In [938]:
preprocessor.get_feature_names_out()
Out[938]:
array(['ctg__gender_male', 'ctg__original_activity_labels_dishwashing',
       'ctg__original_activity_labels_lyingDownLeft',
       'ctg__original_activity_labels_lyingDownRight',
       'ctg__original_activity_labels_sittingChair',
       'ctg__original_activity_labels_sittingCouch',
       'ctg__original_activity_labels_sittingSofa',
       'ctg__original_activity_labels_stakingShelves',
       'ctg__original_activity_labels_standing',
       'ctg__original_activity_labels_step',
       'ctg__original_activity_labels_syncJumping',
       'ctg__original_activity_labels_vacuumCleaning',
       'ctg__original_activity_labels_walkingFast',
       'ctg__original_activity_labels_walkingNormal',
       'ctg__original_activity_labels_walkingSlow',
       'ctg__original_activity_labels_walkingStairsUp',
       'ctg__original_activity_labels_nan', 'num__EEtot', 'num__METS',
       'num__Rf', 'num__VT', 'num__VE', 'num__O2exp', 'num__CO2exp',
       'num__FeO2', 'num__FeCO2', 'num__VO2.HR', 'num__HR'], dtype=object)

Mudeli kordajad:

In [939]:
# rf_pipe = Pipeline([
#     ('pre', preprocessor),
#     ('rf',RandomForestRegressor(max_features='log2', n_estimators= 1000))])

rf_pipe = Pipeline([
    ('pre', preprocessor),
    ('rf',RandomForestRegressor(max_features='log2', n_estimators= 100))])
rf_pipe.fit(X_train, y_train)
Out[939]:
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('rf', RandomForestRegressor(max_features='log2'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(transformers=[('ctg',
                                                  OneHotEncoder(drop='first'),
                                                  ['gender',
                                                   'original_activity_labels']),
                                                 ('num', StandardScaler(),
                                                  ['EEtot', 'METS', 'Rf', 'VT',
                                                   'VE', 'O2exp', 'CO2exp',
                                                   'FeO2', 'FeCO2', 'VO2.HR',
                                                   'HR'])])),
                ('rf', RandomForestRegressor(max_features='log2'))])
ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
                                 ['gender', 'original_activity_labels']),
                                ('num', StandardScaler(),
                                 ['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
                                  'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])
['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
RandomForestRegressor(max_features='log2')
In [940]:
imp = pd.DataFrame(rf_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
Out[940]:
Feature Importance
21 num__VE 0.2356
18 num__METS 0.1878
22 num__O2exp 0.1071
20 num__VT 0.0865
19 num__Rf 0.0792
26 num__VO2.HR 0.0774
27 num__HR 0.0670
23 num__CO2exp 0.0621
17 num__EEtot 0.0364
24 num__FeO2 0.0198
25 num__FeCO2 0.0174
0 ctg__gender_male 0.0077
12 ctg__original_activity_labels_walkingFast 0.0047
16 ctg__original_activity_labels_nan 0.0027
5 ctg__original_activity_labels_sittingCouch 0.0016
13 ctg__original_activity_labels_walkingNormal 0.0015
3 ctg__original_activity_labels_lyingDownRight 0.0012
11 ctg__original_activity_labels_vacuumCleaning 0.0008
6 ctg__original_activity_labels_sittingSofa 0.0007
4 ctg__original_activity_labels_sittingChair 0.0007
14 ctg__original_activity_labels_walkingSlow 0.0006
8 ctg__original_activity_labels_standing 0.0005
1 ctg__original_activity_labels_dishwashing 0.0004
2 ctg__original_activity_labels_lyingDownLeft 0.0003
7 ctg__original_activity_labels_stakingShelves 0.0003
9 ctg__original_activity_labels_step 0.0001
15 ctg__original_activity_labels_walkingStairsUp 0.0000
10 ctg__original_activity_labels_syncJumping 0.0000

Peale GridSearchCV tuunimist kordajate tähtsuste väärtused on muutunud. num__VE tunnuse tähtsus on langenud ca. 3 korda. Kõige tähtsamad kordajad:

  • 21 num__VE
  • 18 num__METS
  • 22 num__O2exp
  • 20 num__VT
  • 19 num__RF

Tulemuste analüüs¶

Tulemuste võrdlus RMSE ja R2 (testandmestikul)

Meetod R2 train R2 test RMSE train RMSE test
Lineaarregressiooni mudel ainult arvuliste tunnustega 0.970 0.970 1.807 1.786
Lineaarregressiooni mudel koos kategooriliste tunnustega 0.974 0.974 1.671 1.653
Polünomiaalregressioon 0.985 0.984 1.309 1.300
Otsustuspuu regressioon 1.000 0.975 0.000 1.623
Otsustuspuu regressioon *GridSearchCV* tuunimisel 0.984 0.978 1.330 1.537
Random Forest regressioon 0.998 0.989 0.457 1.191
Random Forest regressioon *GridSearchCV* tuunimisel 0.998 0.989 0.400 1.058

Kokkuvõte¶

Töö käigus tehtud andmeanalüüsi jaoks on kasutatud erinevad meetodid: Lineaarne Regressioon, Polünomiaalne Regressioon, Otsustuspuu Regressioon ja Random Forest Regressioon.

Hoolimata nende mudelite kirjeldusvõime sarnasustest on leitud erinevused tulemustes:

  • Random Forest regressioonimudel saavutas kõrgeima täpsuse testandmetel 99.8% ja RMSE 1.058, mis on kõigest lähedam nullile.
  • Lineaarregressiooni ja Random Foresti mudelitel täpsuse vahe ligi 1.5%. Lineaarregressioon eeldab lineaarset seost omaduste ja sihttunnuse vahel, samas kui Random Forest mudel teeb mudelite ansambli, mis kasutavad mitmeid otsustuspuusid.
  • Tuunimata Otsustuspuu regressioonimudel näitas testandmetel veidramat kirjeldusvõimet 100% ja RMSE 0.000, kuid testandmetel tulemused realistlikumad.
  • Tulemused olid sarnased Polünomiaalregressiooni mudelil ja Otsustuspuu GridSearch tuunitud mudelil.

Random Forest mudel sobib selle andmestiku sihttunnuse BR (breath rate) ennustamiseks kõige paremini võrreldes teiste mudelitega. See meetod võib olla paremini kohandatud andmetele, mis sisaldavad keerukamaid ja mitmekesisemaid mustreid või mida ei ole lihtne lineaarse mudeliga kajastada.

Otsustuspuud suudavad modelleerida mitte-lineaarseid suhteid, kuid need mudelid võivad hädas olla teatud keerukustega andmestike analüüsimisel.

Random Forest mudelid on tavaliselt komplekssemad kui otsustuspuu mudelid, kuna nad koosnevad mitmest otsustuspuudest. See võimaldab Random Forest mudelitel kohaneda keerukamate andmestikega ja võib põhjustada paremat jõudlust võrreldes üksikute otsustuspuu mudelitega.

Täpsuse tulemuste sarnasus Lineaarregressiooni ja Polünomiaalregressiooni vahel viitab nende sarnasusele alusmustrite kinnipüüdmisel andmetes.

Lineaarregressiooni jääkide hajuvuse diagramm Fitted vs. Residuals näitas, et jäägid on konsolideeritud teatud vormi, mis viitab sellele, et mudel ei pruugi adekvaatselt kajastada mõningaid alusstruktuure andmestikus. See olukord võib omada mitmeid tagajärgi:

  • Mitte-lineaarsus: Konsolideeritud kuju jääkides võib näidata mitte-lineaarsust ennustajate ja sihtmuutuja vahelises suhtes. Andmeanalüüsi tegemisel on leitud, et mudelil on vähe lineaarseid suhteid tunnuste vahel ja seega mudel ei pruugi olla täielikult täpne.
  • Enamiku muutujate vahel puuduvad seosed.
  • Mitmekollineaarsust, mis on rohkem tõenäoline antud andmestiku korrelatsiooni maatriksi ülevaatel.
Polünomiaalregressiooni jääkide diagramm Residuals vs. Fitted omab rohkem hajuvust. See mudel on teisel kohal kirjeldavusvõime järgi, peale Random Foresti tuunitud mudeli.

Kokkuvõttes näitab analüüs, et teatud mudelid, nagu Random Forest ja Koosmõjude Mudel, ületasid teisi mudeleid täpsuse osas.